Skip to content

[Intel NPU] Add Windows & Linux Intel NPU support#1171

Open
Looong01 wants to merge 36 commits into
lightvector:masterfrom
Looong01:Intel_NPU
Open

[Intel NPU] Add Windows & Linux Intel NPU support#1171
Looong01 wants to merge 36 commits into
lightvector:masterfrom
Looong01:Intel_NPU

Conversation

@Looong01

@Looong01 Looong01 commented Mar 16, 2026

Copy link
Copy Markdown

Summary

This PR adds and hardens the Windows & Linux Intel NPU path for KataGo using the ONNX backend with ONNX Runtime + OpenVINO Execution Provider, and updates docs/config guidance for an end-to-end workflow.

It also improves failure behavior for non-ONNX builds and simplifies Windows & Linux dependency handling.

What Changed

1) ONNX backend and OpenVINO provider support

  • Added/updated ONNX Runtime provider selection via onnxProvider (cpu, openvino, cuda, tensorrt, migraphx, coreml).
  • Added/updated OpenVINO-specific runtime options:
    • onnxOpenVINODeviceType
    • onnxOpenVINODeviceId
    • onnxOpenVINOCacheDir
    • onnxOpenVINOEnableNPUFastCompile (best-effort; depends on ORT build support)
  • Supports both:
    • loading raw .onnx models directly
    • loading .bin/.bin.gz models via internal conversion to ONNX graph

2) exportonnx command behavior

  • exportonnx is available in ONNX builds and exports fixed-size ONNX models.
  • Default export board size is 19x19 (-x/-y can override).
  • In non-ONNX builds, exportonnx now returns a clear error instead of failing ambiguously.

3) Config safety for non-ONNX binaries

  • In non-ONNX builds, forcing onnx* config keys now fails fast with a clear message.
  • Prevents silent misconfiguration when users accidentally pass ONNX-only config into CUDA/OpenCL/Eigen/etc builds.

4) CMake dependency flow

  • Kept ONNX runtime root wiring via ONNXRUNTIME_ROOT (defaulting to cpp/external/onnxruntime-win-x64-openvino and cpp/external/onnxruntime-linux-x64-openvino).
  • Added/updated automatic dependency fetch flow for Windows & Linux builds (zlib, onnx, protobuf) through vcpkg when enabled.
  • ONNX runtime DLLs or SOs are copied to output dir during build on Windows or Linux.

5) Documentation updates

  • Compiling.md:
    • Added explicit Windows & Linux Intel NPU setup steps:
      • Visual Studio Community or VS 2026 Build Tools (Desktop C++)
      • Intel NPU driver install
      • OpenVINO archive install
      • ONNX Runtime build with OpenVINO EP (use_openvino=NPU)
    • Added the exact file-copy checklist into cpp/external/onnxruntime-win-x64-openvino.
    • Added minimal ONNX backend build command.
  • README.md:
    • Added Intel NPU quick-start section for ONNX/OpenVINO.
    • Added minimal commands for:
      • exportonnx (default 19x19)
      • benchmark
      • gtp

Behavior Notes

  • Multi-device mapping (onnxDeviceToUseThread*) is mainly intended for ONNX providers like CUDA/TensorRT/MIGraphX.
  • OpenVINO Intel NPU usage is typically single-device.

Validation

  • ONNX build compiles successfully on Windows & Linuix.
  • exportonnx works from .bin/.bin.gz -> .onnx.
  • benchmark/gtp run with onnxProvider=openvino and onnxOpenVINODeviceType=NPU.
  • Non-ONNX binaries now correctly reject ONNX-only config keys.

@Looong01

Looong01 commented Mar 16, 2026

Copy link
Copy Markdown
Author

This is screenshot of Sabaki testing:
屏幕截图 2026-03-16 184650

And the binary release here: https://github.com/Looong01/KataGo-Multi-backends/releases/tag/v1.16.4-openvino

@Looong01

Looong01 commented Mar 16, 2026

Copy link
Copy Markdown
Author

I partially referenced the code from #1164, and I am very grateful to @ChinChangYang

@Looong01 Looong01 changed the title [Intel NPU] Add Windows Intel NPU support [Intel NPU] Add Windows & Linux Intel NPU support Mar 16, 2026
@Looong01

Copy link
Copy Markdown
Author

Add Linux support:
screenshot

@foxrainowo

Copy link
Copy Markdown

This is a wonderful work! I will test this backend in a few days.

@Looong01

Copy link
Copy Markdown
Author

I will implement AMD NPU backend in days.

@foxrainowo

foxrainowo commented Mar 20, 2026

Copy link
Copy Markdown

@Looong01

I conducted some tests on my device with no issues, successfully calling the Intel NPU:
Using b28c512nbt, the GPU speed was 18–23 visits/s, and the NPU speed was 55–70 visits/s. That’s 2.7 to 3 times faster.
For multi-network matches, the speed reached 1.8 times the original.

I’ve come to a preliminary conclusion: the NPU backend should not be configured with multi-threading. Its initialization time depends on the number of threads set—the more threads, the longer the wait. On the other hand, multi-threading actually slows down the computation speed. For single-game analysis, I use a single thread because it is the fastest and offers the best quality (as shown in the figure below, the speed of thread 12 is very slow beacuse of the initialization). For multi-network matches, I set it to “run two games simultaneously” because running too many games at once slows down the speed and reduces performance.

  1. Do you expect this backend to affect accuracy, or are there any comparative tests on this?

  2. During initialization, it generates many blob files. What are these blob files?

  3. What is the function of these parameters? Can they be automated, and is it necessary for users to modify them?
    onnxInputSpatial = input_spatial
    onnxInputGlobal = input_global
    onnxInputMeta = input_meta
    onnxOutputPolicy = out_policy
    onnxOutputValue = out_value
    onnxOutputMiscvalue = out_miscvalue
    onnxOutputOwnership = out_ownership
    onnxModelVersion = 15

NPU_benchmark NPU_match

@Looong01

Copy link
Copy Markdown
Author

@Looong01

I conducted some tests on my device with no issues, successfully calling the Intel NPU: Using b28c512nbt, the GPU speed was 18–23 visits/s, and the NPU speed was 55–70 visits/s. That’s 2.7 to 3 times faster. For multi-network matches, the speed reached 1.5 to 1.8 times the original.

I’ve come to a preliminary conclusion: the NPU backend should not be configured with multi-threading. Its initialization time depends on the number of threads set—the more threads, the longer the wait. On the other hand, multi-threading actually slows down the computation speed. For single-game analysis, I use a single thread because it is the fastest and offers the best quality (as shown in the figure below, the speed of thread 12 is very slow beacuse of the initialization). For multi-network matches, I set it to “run two games simultaneously” because running too many games at once slows down the speed and reduces performance.

  1. Do you expect this backend to affect accuracy, or are there any comparative tests on this?
  2. During initialization, it generates many blob files. What are these blob files?
  3. What is the function of these parameters? Can they be automated, and is it necessary for users to modify them?
    onnxInputSpatial = input_spatial
    onnxInputGlobal = input_global
    onnxInputMeta = input_meta
    onnxOutputPolicy = out_policy
    onnxOutputValue = out_value
    onnxOutputMiscvalue = out_miscvalue
    onnxOutputOwnership = out_ownership
    onnxModelVersion = 15
NPU_benchmark

Thank u for your test.

  1. No. I do lots of tests and this backend will NOT affect accuracy.
  2. Blob files are the compiling cache of NPU. Because the model need to be compiled for the first time if you want to use NPU. It just like any model running on NPU. And it also just like TensorRT backend and generate some cache files.
  3. These are some underlying engine configuration parameters. Users will not use it in general. But it is useful to do debugging.

@foxrainowo

Copy link
Copy Markdown

Thank you!

I am concerned about the poor performance of multi-threading. As shown in the figure, when the number of threads increases, the computation speed actually decreases. Is this because the NPU itself is not suitable for multi-threading, or is it still possible to optimize multi-threading at this stage?

@kaorahi

kaorahi commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system. I really appreciate this. As for katago benchmark, it recommends numSearchThreads = 1 in my case as well.

To build ONNX Runtime, I had to downgrade gcc-15 to gcc-14.

CC=gcc-14 CXX=g++-14 CMAKE_PREFIX_PATH=/usr/lib/cmake/openvino2026.0.0 ./build.sh --config Release --use_openvino NPU --build_shared_lib --skip_tests

Also, the source directories seem different from the document, so I used the following commands in zsh.

cd ~/katago/
mkdir -p cpp/external/onnxruntime-linux-x64-openvino/{include,lib/{cmake/onnxruntime,pkgconfig}}
cd cpp/external/onnxruntime-linux-x64-openvino
cp -r ~/onnxruntime/include/onnxruntime/core include/
cp ~/onnxruntime/include/onnxruntime/**/{cpu_provider_factory.h,provider_options.h,onnxruntime_c_api.h,onnxruntime_cxx_api.h,onnxruntime_cxx_inline.h,onnxruntime_env_config_keys.h,onnxruntime_ep_c_api.h,onnxruntime_ep_device_ep_metadata_keys.h,onnxruntime_float16.h,onnxruntime_lite_custom_op.h,onnxruntime_run_options_config_keys.h,onnxruntime_session_options_config_keys.h} include/
cp ~/onnxruntime/build/Linux/Release/**/{libonnxruntime_providers_openvino.so,libonnxruntime_providers_shared.so,libonnxruntime.so.1.*,libonnxruntime.so.1,libonnxruntime.so} lib/
cp ~/onnxruntime/build/Linux/Release/**/{onnxruntimeConfig.cmake,onnxruntimeConfigVersion.cmake,onnxruntimeTargets.cmake,onnxruntimeTargets-release.cmake} lib/cmake/onnxruntime/
cp ~/onnxruntime/build/Linux/Release/**/libonnxruntime.pc lib/pkgconfig/

@ChinChangYang

Copy link
Copy Markdown
Contributor

Claude detects an issues in a Docker container.

Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined

Error message:

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;

Root cause:

onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.


Reproduction steps:

# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)

System environment:

Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3

Fix:

In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:

-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>

onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

@Looong01

Looong01 commented Mar 25, 2026

Copy link
Copy Markdown
Author

Claude detects an issues in a Docker container.

Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined

Error message:

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;

Root cause:

onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.

Reproduction steps:

# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)

System environment:

Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:

In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:

-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>

onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

@Looong01

Copy link
Copy Markdown
Author

Thank you!

I am concerned about the poor performance of multi-threading. As shown in the figure, when the number of threads increases, the computation speed actually decreases. Is this because the NPU itself is not suitable for multi-threading, or is it still possible to optimize multi-threading at this stage?

Bcs NPU is different arch(totally different from GPU or CPU), single threading is enough for it.

@ChinChangYang

Copy link
Copy Markdown
Contributor

Claude detects an issues in a Docker container.
Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined
Error message:

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;

Root cause:
onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.
Reproduction steps:

# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)

System environment:
Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:
In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:

-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>

onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

@Looong01

Copy link
Copy Markdown
Author

Claude detects an issues in a Docker container.
Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined
Error message:

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;

Root cause:
onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.
Reproduction steps:

# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)

System environment:
Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:
In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:

-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>

onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.
"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".
Or, do u still think I need to do this change?

I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

But I don't meet any error when I compile it. Maybe only happen with GCC-15?

@ChinChangYang

ChinChangYang commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

But I don't meet any error when I compile it. Maybe only happen with GCC-15?

11433e6 resolves the issue. Thanks.

@kaorahi

kaorahi commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system.

This has been working perfectly for the past month. It would be great to have this feature merged into the official KataGo. Without it, I would have almost had to give up on KataGo after moving to my new PC. Thank you again, @Looong01.

On the Intel Core Ultra 7 255U, OpenCL KataGo is sadly slow, running at less than half the speed of a 5-year-old system with a Core i7-1165G7.

@Looong01

Copy link
Copy Markdown
Author

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system.

This has been working perfectly for the past month. It would be great to have this feature merged into the official KataGo. Without it, I would have almost had to give up on KataGo after moving to my new PC. Thank you again, @Looong01.

On the Intel Core Ultra 7 255U, OpenCL KataGo is sadly slow, running at less than half the speed of a 5-year-old system with a Core i7-1165G7.

@lightvector

@lightvector

Copy link
Copy Markdown
Owner

Thanks, I'll also look at this soon.

@Looong01

Copy link
Copy Markdown
Author

Thanks, I'll also look at this soon.

Thanks!

@kaorahi

kaorahi commented May 29, 2026

Copy link
Copy Markdown
Contributor

Thank you for the updates. b37aa25 works fine with the following minor corrections to the ONNX Runtime Backend (Linux) section of Compiling.md.

  • onnxruntime-win-x64-openvino ==> onnxruntime-linux-x64-openvino
  • build\Linux\Release ==> build/Linux/Release

In my environment, I also needed to downgrade GCC when running ./build.sh:

CC=gcc-14 CXX=g++-14 ./build.sh ...

At the moment, this is the only branch that runs fast enough for practical use in my environment. I would appreciate official support for this.

@foxrainowo

Copy link
Copy Markdown

@Looong01 I don't know if this is a problem with the original or with OpenVino.
螢幕擷取畫面 2026-06-17 173405

@Looong01

Copy link
Copy Markdown
Author

@Looong01 I don't know if this is a problem with the original or with OpenVino. 螢幕擷取畫面 2026-06-17 173405

This is a DEVICE_LOST from the Intel NPU that occurred mid-inference, after roughly 6 hours of self-play (~22,950 games).
The core error:
L0 zeCommandQueueExecuteCommandLists result: ZE_RESULT_ERROR_DEVICE_LOST,
code 0x70000001 – device hung, reset, was removed, or driver update occurred
L0 refers to Level Zero — the OpenVINO intel_npu plugin talks to the NPU through the Level Zero API. The call chain is:
ONNX Runtime → OpenVINO EP (ov_interface.cc:28) → intel_npu plugin (infer_request.cpp:224) → Level Zero (zero_wrappers.cpp) → device lost.
Both errors (subgraph_4 and subgraph_3) have nearly identical timestamps (15:06:27.2041348 and .2041655, ~30 µs apart), which indicates this is not a problem with any individual subgraph — the entire NPU device dropped at that instant, so every subgraph running at the time failed simultaneously.
Why DEVICE_LOST is triggered
0x70000001 is a fairly generic device-level error. The likely causes, ordered by probability for this scenario:

  1. Driver updated mid-run (most likely)

The error message itself says "or driver update occurred." Windows Update silently pushes Intel NPU driver updates in the background. If WU updated the NPU driver during the 6-hour run, the device gets re-enumerated and all existing Level Zero contexts/handles are invalidated, causing in-flight inference to fail with device lost.
2. Long-running resource leak / handle accumulation

This is ~22,950 games with many inferences each — a very large volume. If the intel_npu plugin or this OpenVINO version leaks memory or handles when repeatedly creating/destroying infer requests, accumulation past some threshold can hang the NPU firmware, triggering a GPU-TDR-style reset. The fact that it crashed after 6 hours rather than immediately is consistent with an accumulation-type issue.
3. NPU firmware/driver hang (TDR)

A single inference stalls past the watchdog timeout, the NPU is force-reset, and all subsequent command-queue submissions fail.
4. Thermal/power-induced reset

Possible under sustained load, but NPU power draw is low, so this is the least likely.
Suggested investigation
First, rule out the simplest cause — a driver update:
powershellGet-WinEvent -LogName System | Where-Object {
$_.Message -match "NPU|Intel.*AI Boost|driver"
} | Select-Object TimeCreated, Id, Message -First 20
Focus on whether there were any driver-install / device re-enumeration events around 15:06. You can also check Get-WindowsUpdateLog or the Update history in Settings.
If a driver update is ruled out, other directions:

Add auto-restart + error recovery to the self-play loop. This is the most practical fix: after a device lost, the current process generally has to rebuild the ONNX Runtime session (re-initialize the Level Zero context) — simply catching the exception and continuing will likely fail on all subsequent inferences. The most robust approach is to have an outer script detect this error code, kill the process, and relaunch it, resuming from the last SGF/checkpoint.
Disable automatic driver updates for the NPU to prevent long-running jobs from being interrupted (disable auto-update for the device in Device Manager, or pause Windows Update).
Upgrade OpenVINO / the NPU driver to the latest stable version and re-run, to check whether the leak has been fixed. Reproducing with a short high-frequency stress test is more efficient than blindly running for 6 hours.
If you suspect a leak, monitor NPU memory usage during the run to see whether it grows monotonically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants